Data Encoding Schemes

You probably noticed that when you get a bill or bank statement, it uses the current modern Western version of Hindu-Arabic numerals. Yet during the early days of the Renaissance, the accounting firms in Italy often boasted that they kept their records using “God’s own, well understood Roman numerals” for their accounting.

Your choice of how you encode data can be significant. With computers and databases, you have a wide range of options. As a programmer, you will likely be called upon at some time in your career to design and coding scheme for particular application. It would really help if you at least avoided designing a bad one.

But what are the kinds of bad encoding schemes?

General Encoding Scheme Issues

In this section, I will present several factors that you need to consider when deciding how to encode data in your databases and applications.

Ambiguous Codes

Unless you are writing a murder mystery, where it adds to the mystery for the reader to not know exactly what interpretation of some writing adds to the air of mystery, ambiguity is generally a bad thing.

When creating codes that you will store to represent some real-world value, we will probably all agree that you don’t want ambiguous codes. There are two kinds of ambiguity to be concerned with. Does one code stand for two totally different values or entities? Or do two totally different encodings indicate the same value or entity? Please note that neither of these situations has anything to do with misread data.

Rather, this covers situations like when two different books had the same ISBN. This happened because the string of 10 digits that make up the ISBN can be parsed several different ways to get the language, publisher, and the book within the publisher substrings.

Later in the article, we will get into techniques for creating consistently readable encodings by choosing the right character set and using check digits.

Limited Growth Range

A particularly bad design is an encoding scheme that does not allow for reasonable/realistic growth. Sometimes this bad planning, but sometimes things just grow bigger than you think they’re going to. Decades ago, American Honda assigned five-digit dealership numbers, because they thought they would never get bigger than that. Those systems used COBOL, an older programming language which has fixed length fields, so adding extra digits required rewriting a large amount of code.

Sometimes, as a user, you may question why something that seems quite limited (like an Account Number for a small company), has 10 digits. Part of this is to allow for exponential growth, check digits, and part of it may be software that has been purchased that was created for a variety of purposes and this is a small company using it.

Missing Data Problems

Missing data presents its own problems. There are well over a dozen reasons as to why the data is missing and sometimes, we really need to know those reasons (unknown, not applicable, miscellaneous, erroneous value, etc.). Do you assume a default? Do you generate a value (there were various statistical methods to do this in some circumstances).

As you are thinking about your method of encoding data, being able to identify the cause of missing data becomes quite valuable for users.

Encoding Scheme Types

There are various types of encoding schemes and virtually all of them see pretty regular use. In this section I will cover a selection of these schemes.

Enumeration

The simplest encoding is enumeration. Set up a list of the values of the attributes you want to keep in your database and then decide whether to represent them with numbers versus letters. Database programmers who are new to SQL often uses the IDENTITY property or some other auto numbering because it’s quick and easy. However, such autogenerated values have a lot of downsides when used as simple integer values.

It’s much better take a little time and plan out how your values will look. Just as matter of design, leading zeros make an encoding look more like an encoding and less like quantity. For example, you are pretty sure that ”042” is not a quantity or magnitude.

Numeric, alphabetic and alphanumeric codes can be simply sorted as if there were strings. It also nice if these enumerations have some sort of logical ordering, so you might want to look at putting them in temporal order, procedural order or something else that follow how the encoding is used.

Measurements

Many of the attributes will be measurements taken on well-defined, industry standard scales. However, it’s really important to know what the units are. You can Google around for horror stories about people mixing up gallons and liters or US customary measure with metric units.

It’s another thing to watch for is how many decimal places you wish to use to store the data . If the data is in an unit of measure in SI (systeme internationale or International System of Units; the proper/modern name for what we call “metric” in the states), you can use a DECIMAL(s, p) data type with a scale and precision with an extra place to allow for rounding when you’re doing arithmetic.

Abbreviation Encoding

Abbreviation encodings are generally best done with a fixed length string. If you’re old enough to remember when there was no ZIP Code, the state of California was “Calif” for its postal abbreviation, while Pennsylvania was usually just PA or Penn. The modern two letter state and province codes were both concise and a great improvement over the previous conventions.

The only problem with abbreviation encodings is that if you want to keep them to a fixed length, you can run out of simple, obvious possibilities. For example, consider the current Airport codes. it’s pretty easy to figure out BOS for “Boston”, and ATL for “Atlanta”, but as they get a bit less obvious, like LAX is the three-letter airport code for “Los Angeles” they can become a bit more problematic. Some of these codes don’t make a ton of sense initially. For example: BNA for Nashville, or even WKK for “Aleknagik, Alaska” is not obvious to most people.

The big issue is when your abbreviation is not based on what you expect. For example BNA is actually named that for the airport’s original name Berry Nashville Field, named after Colonel Harry S Berry, a local Nashville aviator.

Hence, it can be important to really understand your use of such abbreviations and how likely the person using them will need to decode them to realize what they mean.

Algorithmic Encoding

Hashing and other data transforms are a type of encoding. Usually hashing is done to speed up access, much like indexing. But if the hashing is exposed in the data rather than kept only by the system, then it can become part of the data.

This is a topic in itself. I really don’t want to go into it too deeply, but it is worth mentioning that if the data doesn’t change, and there’s a nice proof that you can always have a minimal perfect hashing function there can be value.

This means that the hash table has no collisions (“hash clash”), and there are no empty spots in the hash table. If this doesn’t mean anything to you, then I’m going to tell you to go back one of your old textbooks. As an aside, if your database uses hashing rather than indexing for data access, then there are proofs for the maximum number of rehashing required to locate single record for any size database.

Again, something you don’t think of as an encoding would be encryption. And this is also a topic in itself. I won’t go into it.

Hierarchical Encoding

As humans, we really like hierarchies. They organize data for us in ways that let us handle incredibly large amounts. There also very versatile. The ZIP Code system in the US, and similar ones in other countries, are based on geography. They don’t actually have to be numeric; it just does make it much easier to work with.

In the case of ZIP Codes, we can look at a five digit number with leading zeros and see that at least it’s a valid ZIP Code by simple inspection; even if it might not be an actual ZIP Code when we check it against which numbers are currently in use. Other hierarchical encoding systems you might have run into are ICD (International Classification of Diseases) and several other medical systems. However, the one you most likely know and have used is the Dewey Decimal Classification system for books used in English language libraries.

There are some problems with this type of encoding. Will all values fit into the scheme? Expanding part of a hierarchy can be awkward. What you thought might have been a minor branch of the hierarchy can turn out to be quite a bit larger than originally anticipated. In getting enough coverage for something as general as all possible subjects in a library can be difficult.

One encoding system that readers of a certain age will recognize from their days in the library is The Dewey Decimal system. It has actually been the subject of more politics than you would think. There is the obvious problem of different political organizations wanting to have their opponents classified in a less favorable light. Think about China wanting Taiwan to be classified as “an outlaw province of China” and all types of similar things. The pseudosciences want to fit into the 500s to gain respectability.

Vector Encoding

Vector encoding is often confused with a hierarchical one. With vector encoding, the code is broken into fields, separated either by punctuation, a known field size or a change in the character set. The important thing is that the fields are relatively independent of each other, have some meaning by themselves, and that they only make sense when taken together. In Europe, these encoding schemes are also known as facet codes.

A version of this type of encoding is becoming more and more important in the AI field.

The term “field” is Standard SQL and the usual example in SQL is the DATE data type. A date breaks down into YEAR, MONTH, and DAY fields. Standard SQL uses the ISO 8601 format for temporal data types – yyyy-mm-dd. While the dashes to separate the fields with punctuation, there actually of fixed length, so the punctuation is not really needed (it just makes a date easier to read and parse).

Another common example is ISO tire sizes. These also have three fields, which represent the radius, width and the material of the tire. Again, attire has to have all three subfields in order to have physical existence.

Concatenation Encoding

A concatenation encoding is built from a string of values that is extended as the attribute is extended. It’s something like a vector code, but it doesn’t start off with a fixed number of fields. Frankly it’s not very popular anymore. These codes used to be used in machine shops for tracking piecework from one workstation to the next. It stayed popular in the aircraft industry for machine shop parts.

Originally, you had a large paper tag attached to each part. As each step in the manufacturing process was finished, it was noted on the tag by the machinist or your supervisor. The goal was to get an audit history for each part. Today, we probably use a scanner, barcodes and the database.

Authority Encoding

Authority encodings depend on having an authority that issues them and tracks them for you. This is where things get a little bit vague. For example, ZIP Codes are controlled by the U.S. Postal Service. You can go online or by directory and find out if ZIP Codes have been issued and to what post office or town it references. But a ZIP Code can be validated or partially validated, in and of itself. For example, a ZIP Code of the format 3xxxx is in the southeast of the United States.30 if it has a format 303xx, it’s in Atlanta. Finally, the ZIP Code 30310 belongs to the West End station in Atlanta and covers neighborhoods in that part of the city.

ZIP Code don’t map nicely to every city and area, as there are even a few ZIP Codes that cross state lines, and many that include multiple towns.

Now consider the EIDR encoding which stands for Entertainment Identifier Registry. It is a long string of digits with some embedded punctuation. It references a public registry that provides a unique identifier for audiovisual content across the global entertainment industry. It was established to address the need for a standardized identification system that can uniquely identify and track various forms of entertainment content. There is a check digit in it, but the whole code is nothing you’re going to easily write yourself. Your company signs up, becomes a member of this organization and codes are assigned to you by this defining authority.

The Data is Encoded, How Do I Know I Got It Right?

Once you’ve got the coding system designed, how do you know if you have it right? If I’m using an authority encoding, I just go to that authority. This obviously is not practical for everything. But I have some methods to assure data quality at the database level.

Please don’t confuse Validation and Verification. ZIP code 99951 doesn’t even exist (at least not as of the publishing of this article!). In fact, the highest real zip code is 99950, for Ketchikan, Alaska. But the code has five digits and looks correct.

It passes simple validation (zipcode CHAR(5) NOT NULL CHECK (zipcode LIKE ‘[0-9][0-9][0-9][0-9][0-9]’)) and fails verification when you try to use it for mail.

Numeric Ranges

if an attribute is a numeric value, then you can add a simple CHECK constraint in the DDL you create tables. This constraint needs to consider if the attribute has to be a member of a set with a IN() predicate, or a BETWEEN predicate for a range. For example, if you had a code that represented the pain a patient felt (the numeric representation of the pain scale they use with the emojis from frowning to smiling that runs from 1-10), you could do IN (1,2,3,4,5,6,7,8,9,10), or because it is inclusive like this BETWEEN 0 and 10.

Note that sometimes it can be better to create a table to represent values and use a FOREIGN KEY constraint, since the values may have additional data that can be useful to the users of the database.

An example of the second situation might be “quantity BETWEEN 0 and 500”; however, please note that people tend to forget that integers can be negative and check lower ranges when validating values! I once dealt with a situation where negative quantities in order would result in refunds in the front-end program. It was a nice piece of embezzling.

Regular Expressions

Various SQL products have a pretty wide range of different regular expressions. For purposes of a CHECK() constraint, it’s probably best to stick to the absolute minimum that you can use in every product. That would mean the _ , the % and the singleton character select list. This should be enough.

Regular expressions can be used to check the format of values rather easily but typically should not be used to do extremely complex checks unless basically proved mathematically or at least very well tested with as many values as possible.

Check Digits

Using a check digit in your encoding scheme can help stop users from accidentally choosing a random value by making values non-contiguous. A check digit is an extra string of characters (usually a single digit) added to a value (typically tacked on the end of the string). It is computed from all the preceding characters in that string; if the computation matches check digits, we are pretty sure that what was input was correct. Check digits are typically not security mechanisms, as they are easily deciphered by anyone. They are more typically there to make sure that you are unlikely to mistype one value and get to another real value.

A common example is the Luhn algorithm used a lot of barcode and credit card systems. It was patented by IBM and named after its inventor. You can play with it at https://simplycalc.com/luhn-calculate.php. It works by assigning weights to each of the digits in a number, doing a summation and then taking mod 10 of that result. This guarantees it’s always a single digit.

This sort of thing should be done in the input tier of your tiered architecture, rather than in the database tier. You’ve got an invalid code, do not insert it into the database and let the user know that he needs to fix it before he goes any further.

There are other check digit algorithms that catch more errors. For example, you can use weights of (1, 3, 7) instead of (1, 2). the most effective one that I know of is the dihedral five algorithm. It takes a large number of errors, but the math is based on group theory, and it gets pretty elaborate.

Note: Please don’t confuse detecting codes, the check digits, with error correcting codes. Error correcting codes usually require extra bits added to each bite at the hardware level.

Conclusion

And you thought that normalization was hard! Just be grateful that organizations that work with industry standards have guys that are watching out for the stuff so that you don’t have to when you get to your database.

Register for Simple Talk

Data Encoding Schemes

General Encoding Scheme Issues

Ambiguous Codes

Limited Growth Range

Missing Data Problems

Encoding Scheme Types

Enumeration

Measurements

Abbreviation Encoding

Algorithmic Encoding

Hierarchical Encoding

Vector Encoding

Concatenation Encoding

Authority Encoding

The Data is Encoded, How Do I Know I Got It Right?

Numeric Ranges

Regular Expressions

Check Digits

Conclusion

Article tags

About the author

Joe Celko

Joe's contributions

Articles

Books

Top topics

Joe's latest contributions:

Data Elements on a Nominal Scale

The CREATE DOMAIN Statement

Scales & Measurements

General Encoding Scheme Issues

Ambiguous Codes

Limited Growth Range

Missing Data Problems

Encoding Scheme Types

Enumeration

Measurements

Abbreviation Encoding

Algorithmic Encoding

Hierarchical Encoding

Vector Encoding

Concatenation Encoding

Authority Encoding

The Data is Encoded, How Do I Know I Got It Right?

Numeric Ranges

Regular Expressions

Check Digits

Conclusion

Article tags

Recommended

About the author

Joe Celko

Joe's contributions

Articles

Books

Top topics

Joe's latest contributions:

Data Elements on a Nominal Scale

The CREATE DOMAIN Statement

Scales & Measurements